CEMA Internship Task

1 Environment set up

# Avoid printing of numbers in scientific format
options(scipen = 999)

# Required Packages
pacman::p_load(tidyverse, janitor, dlookr, naniar, plotly, timetk, trelliscopejs)

# Loading data 
cema <- read_csv("data/cema_internship_task_2023.csv", show_col_types = F)

# Set default plotting theme
theme_set(theme_minimal())

2 Some Data cleaning

cema_clean <- cema %>% 
  # Cleaning column names to make working with the dataset easier
  clean_names() %>% 
  # convert the period and county columns to their correct data type
  mutate(
    period = my(period),
    county = as.factor(county)
    ) %>% 
  # remove duplicates rows if any
  distinct()

3 Exploratory analysis

3.1 Dataset Structure

We have 1410 rows and 11 variables in the data as the ouput below shows.

# Structure
cema %>% glimpse()
## Rows: 1,410
## Columns: 11
## $ period                     <chr> "Jan-23", "Jan-23", "Jan-23", "Jan-23", "Ja…
## $ county                     <chr> "Baringo County", "Bomet County", "Bungoma …
## $ `Total Dewormed`           <dbl> 3659, 1580, 6590, 7564, 1407, 3241, 6751, 4…
## $ `Acute Malnutrition`       <dbl> 8, NA, 24, NA, NA, 72, 250, 9, 26, 104, 36,…
## $ `stunted 6-23 months`      <dbl> 471, 1, 98, 396, 92, 326, 40, 209, 51, 319,…
## $ `stunted 0-<6 months`      <dbl> 34, 3, 154, 143, 71, 86, 13, 87, 6, 102, 27…
## $ `stunted 24-59 months`     <dbl> 380, NA, 23, 111, 5, 24, 99, 58, 50, 155, 2…
## $ `diarrhoea cases`          <dbl> 2620, 1984, 4576, 2239, 2739, 1376, 2314, 2…
## $ `Underweight 0-<6 months`  <dbl> 85, 41, 231, 251, 57, 141, 223, 140, 13, 13…
## $ `Underweight 6-23 months`  <dbl> 739, 86, 315, 608, 104, 544, 1856, 298, 180…
## $ `Underweight 24-59 Months` <dbl> 731, 16, 120, 125, 21, 160, 1833, 84, 271, …

3.2 Exploring Missingness

The plot below shows the percentage of missing values in each variable in our dataset. Most values are missing in the variable acute_malnutrition (about 25%). Other variables with missing values are stunted 0-<6 months, stunted 24-59 months, and stunted 6-23 months with 1.35%, 0.99%, and 0.78% missingness respectively. You can hover over the plot to see these values.

ggplotly(
  p = cema_clean %>% 
  gg_miss_var(show_pct = T)+
  labs(title = "Percentage of Missing Values per Variable") + theme_bw()
)

3.3 Univariate Data EDA

3.3.1 Lets compute descriptive statistics to help determine the distribution of numerical variables.

We can see the variable stunted_0_6_months has a large positive skewness. For modeling reasons we should consider a log or sqrt transformation on this variable to follow the normal distribution.

cema_clean %>% describe()

We can take the above analysis further by grouping the data by county and computing the descriptive statistics.

cema_clean %>% 
  group_by(county) %>% 
  describe() %>% 
  arrange(described_variables, desc(skewness))

3.3.2 Visualization of normality of numerical variables

We can see that all of the numeric variables in the dataset are positively skilled. We also see how the ditributions of these would look like when square root and log transformations are applied.

cema_clean %>% 
  plot_normality()

3.4 EDA of bivariate data

3.4.1 Calculation of correlation coefficient

The following is a correlation matrix show the association of the numeric variables in the dataset. All of them are positively correlated.

cema_clean %>% 
  correlate() %>% 
  plot()

## Analysis of Numeric variables over time

Lastly, as part of our exploratory analysis, we can leverage the period column to explore how the numeric indicators are changing over time by county. We can do this by plotting each numeric indicator against time, for each county. In the chunk below we filter for rows where county is Embu and plot the total dewormed children over time. The plot shows a relatively constant trend with conspicuous hikes in every May and November of every year.

ggplotly(cema_clean %>% 
  filter(county == "Embu County") %>% 
  ggplot(aes(period, total_dewormed))+
  geom_line()+
  geom_point()+
  labs(
    title = "Total Dewormed Children Over Time", x = "Date", y = "Total Dewormed"
  )+
  theme_bw()
)

Since we have several counties (47) in the data, visualizing all of them in one plot will be will create a visual hard to interpret. We can leverage some javascript to scale the faceting to all 47 counties and indicators. This will create an interactive display which one can search, sort, and filter for any indicator by county.

cema_clean %>% 
  mutate(county = str_remove(county, " County")) %>% 
  pivot_longer(
    3:last_col(), names_to = "indicator", values_to = "value"
  ) %>% 
  unite("analytic", county, indicator, sep = " - ") %>% 
  group_by(analytic) %>% 
  plot_time_series(
    period, value, 
    .interactive = F, .x_intercept = "Date", .y_lab = "Value",
    .title = "General Trend by County and Indicator", 
    .smooth = F
    )+
  geom_point()+
  # facet by all groups in the pivoted data
  facet_trelliscope(~ analytic, scales = "free", width = 1000, nrow = 3, ncol = 3, as_plotly = T, path = "./")

4 Research Question

Is there any association between the occurrence of diarrhea cases in children under 5 years and the number of children dewormed in each county?

4.1 Correlation between the occurrence of diarrhea cases and the number of children dewormed in each county

In the code chunk below we first remove the unnecessary county string in the county column, then we filter out rows with missing values in the diarrhoea_cases and total_dewormed columns, as we need complete data to calculate the correlation.

After that, we group the filtered data by county and calculate the correlation coefficient (Pearson correlation) between diarrhoea_cases and total_dewormed using the cor function.

cema_clean %>% 
  mutate(county = str_remove(county, " County")) %>% 
  drop_na(diarrhoea_cases, total_dewormed) %>% 
  group_by(county) %>%
  summarize(correlation = cor(diarrhoea_cases, total_dewormed, use = "complete.obs")) %>% 
  ggplot(aes(fct_reorder(county, correlation), correlation))+
  geom_col(fill = "steelblue")+
  coord_flip()+
  theme_bw()+
  labs(
    title = "Association between the occurrence of diarrhea cases in children under 5 years and the number of \n children dewormed by county",
    x = "County", y = "Correlation Coefficient"
  )+
  theme(
    axis.text = element_text(face = "bold"),
    plot.title = element_text(vjust = .5, size = 14, face = "bold", colour = "steelblue")
  )

4.1.1 Interpretation

We see a correlation coefficient close to 1 (positive) in counties like Samburu, Busia, Kisii and Turkana. This suggests a strong positive association between Diarrhoea casess and the number of dewormed children in these counties. In other words, as the number of dewormed children increase in these counties, the number of diarrhea cases among children under 5 years will most likely increase.

We also observer a weak negative correlation coefficient in counties like Nyeri, Kakamega, Tharaka Nithi and Kitui. This suggests a weak negative association between Diarrhoea Cases and the number of dewormed children in these counties. This means as the number of dewormed children increase in these counties, the number of diarrhea cases among children under 5 years will most likely decrease.